library(readr)library(dplyr, warn.conflicts=FALSE) # This is to suppress the warning message, please feel free to switch off `warn.conflicts=FALSE` if you would like to see what happenslibrary(ggplot2)library(plotly, warn.conflicts=FALSE)# Import the data# Adapted from https://sl.bing.net/jPqB0uyR0mGmain_df <-read_csv("01-data-inputs/ms5130_w07_sales_by_county.csv", col_types =cols(.default ="c", sales ="d"))mapping_df <-read_csv("01-data-inputs/ms5130_w07_county_mapping.csv", col_types =cols(.default ="c"))fig1 <-plot_ly(main_df, x =~sales, color =~county, type ="box", colors =c("red", "blue"))fig1
This is not very helpful, due to inconsistencies. To resolve the inconsistencies, we can perform entity resolution using a mapping file.
The result:
Code
# Merge dataframesmerged_df <- dplyr::left_join(main_df, mapping_df, by ="county")# Replace inconsistent county namesmerged_df$county <-ifelse(is.na(merged_df$mapped_county), merged_df$county, merged_df$mapped_county)# Clean up (remove mapped_county column if needed)cleaned_df <- merged_df[, c("shop_id", "county", "sales")]# Save cleaned dataframe to a new CSV filewrite.csv(cleaned_df, "02-data-outputs/cleaned_data.csv", row.names =FALSE)fig2 <-plot_ly(cleaned_df, x =~sales, color =~county, type ="box")fig2
Something to consider: “Which county had the best sales figure?”